Learning outcomes: PCA, interactive 3D plots

I understand that maybe these detailed examples can feel more confusing than helpful. Therefore, should you feel frustrated about all my data wrangling and modification, don’t stress. Copy-paste the code and fast-forward to where the plotting starts. That’s totally fine.

This set of exercises is based on untargeted metabolomics data (“metabolome”) from three sponge species ( Geodia barretti, Stryphnus fortis and Weberella bursa) but as you’ll see in a minute, this data is really just huuuge matrices of numbers. Unlike in the ‘targeted metabolomics’ exercise, we really don’t know which compounds we’re looking at.

The metabolome was acquired in positive and negative ionization mode on an Acquity I-Class Ultra Performance Liquid Chromatography UPLC (HILIC and RP) coupled to a G2S Synapt Q-TOF with an electrospray ionization (ESI) ion source. This means, in total we have four data set (HILIC positive, HILIC negative, RP positive and RP negative). I processed the raw data with XCMS and CAMERA in R. To give you an idea, these are the commands for the HILIC positive data set.

The raw data is quite large, processing takes several days and I think this is a bit too specific, so we’ll just explore the data together if that’s okay. There’s still plenty to do.

# HILIC 
# setwd() 
library(xcms)
xset <- xcmsSet(method = "centWave", ppm = 8, peakwidth = c(5, 45), noise = 2000)
save(xset, file = "HILIC_pos_xset_20190417.Rda")
# load(file='HILIC_pos_xset_20190417.Rda') #When resuming after a break
xset <- group(xset)
xset2 <- retcor(xset, method = "obiwarp", response = 10, plottype = "deviation")
xset2 <- group(xset2)
xset3 <- fillPeaks(xset2)
save(xset3, file = "HILIC_pos_xset3_20190417.Rda")
reporttab <- diffreport(xset3, "Gb", "Sf", "Gb_Sf_HILIC_pos_20190417", 10)

library(CAMERA)
xsa <- xsAnnotate(xset3)
xsaF <- groupFWHM(xsa, perfwhm = 0.3)
xsaC <- groupCorr(xsaF, cor_eic_th = 0.7)
xsaFI <- findIsotopes(xsaC)
rules <- read.csv("data/rules_jan_pos.csv", header = T, sep = ",")
xsaFA <- findAdducts(xsaFI, polarity = "positive", rules = rules)
# write.csv(getPeaklist(xsaFA), file='HILIC_pos_20190417.csv')

The four data sets are called “HILIC_pos_…”, “HILIC_neg_…”, “RP_pos_..” and “RP_neg_…” and I provided you with the actual output of the above code, there’s no secret sauce in between. You can load just the HILIC positive, the others are all fairly similar. This data goes with the same metadata as before (metadata_metabolomics.csv).

HILIC_pos <- read.csv("data/HILIC_pos_20190417.csv")
# HILIC_neg <- read.csv('data/HILIC_neg_20190421.csv') RP_pos <-
# read.csv('data/RP_pos_20190421.csv') RP_neg <-
# read.csv('data/RP_neg_20190422.csv')

# for later
meta_data <- read.csv("data/metadata_metabolomics.csv", header = T, sep = ";")

Have a look at the data frame. The first 12 columns are information about the data, which is present in the subsequent columns, called “IE_…”. Every row is a feature (general term for ion or adduct). A feature is characterized by a m/z (mass over charge) in combination with a unique (range of) retention time rt. So for every sample, we have a value for each feature, that means: How much of unknown compound with m/z x and rt y is present in each sample. The retention time is measured in seconds. Overall, among these “IE_…” samples there is the data from all samples plus a group of QC (pooled Quality control) samples.

To start we

To be able to easily compare, I copied the HILIC_pos data frame and called it “raw_ peaks”.

raw_peaks <- HILIC_pos

# Removing features eluting in the void, i.e. I only keep rows with an rt
# greater than 45
raw_peaks <- raw_peaks[raw_peaks$rt > 45, ]

# Removing features with a CV > 30%
f <- which(colnames(raw_peaks) == "IE_20170918_02001")  # finding the first QC sample, HILIC pos
l <- which(colnames(raw_peaks) == "IE_20170918_07601")  # finding the last QC sample, HILIC pos

# The coefficient of variation is defined as the standard deviation divided by
# the mean. So this is calculated for every feature, i.e. every row across the
# QC samples. I do it step-by-step.

# apply() is a handy function that allows you, as the name indicates, to apply
# a function across either rows (1) or columns (2). You can see the functions I
# apply at the end. sd stands for standard deviation, mean is the mean. I
# specify 'only across the QC samples' [, f:l].
raw_peaks["SD"] <- apply(raw_peaks[, f:l], 1, sd)
raw_peaks["MEAN"] <- apply(raw_peaks[, f:l], 1, mean)

# Here I calculate the coefficient of variation
raw_peaks["CV"] <- raw_peaks$SD/raw_peaks$MEAN

# Now I only keep those rows with a coefficient of variation smaller than 0.3
# or 30%
raw_peaks <- raw_peaks[raw_peaks$CV < 0.3, ]
HILIC_pos_cleaned <- raw_peaks
# Now we have a filtered data set. You see we've lost some data (there are
# fewer rows in 'HIILIC_pos_cleaned' then in 'HIILIC_pos'). If you want to, you
# can scroll to the end of the data frame and see the new columns withsd, mean
# and cv.

For the multivariate analyses, we need only numerical data. And it needs to be in the proper format, that is rows are observations (the samples) and columns are variables (the features). This is a bit boring and tedious and as I repeated this several times across the different data sets and all kinds of analyses I did, I wrapped all the commands in a function.

I’ve annotated it for you but I don’t expect you to write this yourself. But it might be interesting to see it’s no big deal. To use it, you need to execute the function that I’ve called “formatting”, it will appear to your right in “Functions” and the single command at the bottom (hilic_pos <- formatting(HILIC_pos_cleaned, meta_data, 6, "LC.MS.HILIC.positive")) executes all of that at once.

library(tidyverse)

formatting <- function(metabolome, meta_data, r, my_colnames) {
    formatted <- metabolome
    formatted <- formatted[, 14:(dim(formatted)[2] - r)]  # delete the first couple of columns
    formatted <- data.frame(t(formatted))  # transpose the data
    formatted["ID"] <- rownames(formatted)  # a couple of steps to replace the sample names with the unified_ID I used across several analyses
    formatted["unified_ID"] <- meta_data$unified_ID[match(formatted$ID, meta_data[[my_colnames]])]
    formatted["filter"] <- str_sub(formatted$unified_ID, 1, 2)
    formatted <- formatted[!formatted$filter == "QC", ]  # remove QC samples
    formatted <- na.omit(formatted)  # remove missing data/NA's
    # formatted$filter <- NULL
    formatted$ID <- NULL  # delete superfluous column
    formatted <- formatted[order(formatted$unified_ID), ]  # sort table
    rownames(formatted) <- formatted$unified_ID  # set IDs as rownames
    formatted$unified_ID <- NULL  # delete superfluous column
    formatted$filter <- NULL
    return(formatted)  # return the modified data frame
}

hilic_pos <- formatting(HILIC_pos_cleaned, meta_data, 6, "LC.MS.HILIC.positive")
# hilic_pos[1:5, 1:5] # to see the beginning of the data frame

# to not get confused woth our other data sets, let's delete them
rm(HILIC_pos_cleaned, HILIC_pos)

This is now a data set ready for analyses.

PCA

A PCA is a dimensionality reduction method for data exploration. We have a total of 44 samples (across three species) but 3508 variables (features or metabolites). The PCA helps to linearly combine the variables to account for most variation in that data in new artificial “principal components” (PC). The first, PC1, should explain most of the variation in the data.

The function to perform a PCA is called prcomp() and very straight forward.

# The PCA itself
sponge_pca <- prcomp(hilic_pos, scale = T)

# Save the summary
pca_summary <- summary(prcomp(hilic_pos, scale = TRUE))

# The 'proportion of variance' is how much of the total variation is explained
# by a PC.
pca_summary$importance
##                             PC1      PC2      PC3      PC4      PC5      PC6
## Standard deviation     28.15360 25.16374 13.18239 11.61775 10.97991 10.34581
## Proportion of Variance  0.22601  0.18056  0.04955  0.03849  0.03438  0.03052
## Cumulative Proportion   0.22601  0.40657  0.45612  0.49461  0.52898  0.55950
##                             PC7      PC8      PC9     PC10     PC11     PC12
## Standard deviation     10.23360 9.649427 9.096311 8.562855 8.300362 8.232565
## Proportion of Variance  0.02986 0.026550 0.023590 0.020910 0.019650 0.019330
## Cumulative Proportion   0.58937 0.615920 0.639510 0.660420 0.680060 0.699390
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     7.923549 7.804373 7.505818 7.330147 7.265491 7.116605
## Proportion of Variance 0.017900 0.017370 0.016060 0.015320 0.015050 0.014440
## Cumulative Proportion  0.717290 0.734660 0.750720 0.766040 0.781100 0.795540
##                            PC19     PC20     PC21     PC22     PC23     PC24
## Standard deviation     6.882674 6.639195 6.556044 6.453196 6.179417 6.054724
## Proportion of Variance 0.013510 0.012570 0.012260 0.011870 0.010890 0.010450
## Cumulative Proportion  0.809040 0.821610 0.833870 0.845740 0.856630 0.867090
##                            PC25     PC26     PC27     PC28     PC29     PC30
## Standard deviation     5.969914 5.808419 5.675967 5.601095 5.566878 5.307529
## Proportion of Variance 0.010160 0.009620 0.009190 0.008950 0.008840 0.008030
## Cumulative Proportion  0.877250 0.886870 0.896050 0.905000 0.913840 0.921870
##                            PC31     PC32    PC33     PC34     PC35     PC36
## Standard deviation     5.275441 5.253955 5.13989 5.027701 4.996068 4.880103
## Proportion of Variance 0.007940 0.007870 0.00753 0.007210 0.007120 0.006790
## Cumulative Proportion  0.929800 0.937680 0.94521 0.952420 0.959530 0.966320
##                            PC37    PC38     PC39     PC40     PC41     PC42
## Standard deviation     4.696129 4.63991 4.263526 4.096386 3.923962 3.672711
## Proportion of Variance 0.006290 0.00614 0.005180 0.004780 0.004390 0.003850
## Cumulative Proportion  0.972610 0.97875 0.983940 0.988720 0.993110 0.996960
##                            PC43         PC44
## Standard deviation     3.266775 1.981253e-14
## Proportion of Variance 0.003040 0.000000e+00
## Cumulative Proportion  1.000000 1.000000e+00
# biplot(prcomp(hilic_pos, scale = TRUE)) # You can start drawing this and then
# press stop before R overlays all the loadings and you don't see anything
# anymore...  The values to be plotted are in sponge_pca$x

# This is the start of the data frame
sponge_pca$x[1:5, 1:5]
##             PC1       PC2        PC3        PC4         PC5
## Gb1  -4.7747015 -34.80680  0.5494318 -3.3129347   0.1732553
## Gb10 -3.4224414 -39.37541  3.4332047  0.8168009   5.0213546
## Gb12 -2.2413883 -25.15938 -5.5344854  6.5189835 -13.3513944
## Gb13  0.3669896 -25.56697 -9.1355455 17.3768955  -8.5400717
## Gb15 -4.1869640 -33.33305 -0.1358031  0.1974725  -2.8626989

sponge_pca$x are the scores, the values for the new principal components to describe our samples. We’ll use these to plot the PCA. You can do this already. Make a plot with the first two principal components using ggplot2.

Hint:

# 'sponge_pca$x' is not a data frame and so ggplot will complain.  you can make
# it a data frame with this simple command: data.frame()

Solution:

pca_plot <- data.frame(sponge_pca$x)
ggplot(pca_plot, aes(x = PC1, y = PC2)) + geom_point()

Yippie! To make this meaningful, let’s use the metadata. We can retrieve the unified_ID from the rownames of the data frame and use the text replacement command gsub() to replace all numbers of any length ’[0-9]*’ with nothing ’’. We then can set e.g. the shape or color of the dots with it. Also, let’s add the information about the variation per PC to the axes.

Solution:

k <- pca_summary$importance
x_lab <- paste("PC1", round(k[2, 1], digits = 3) * 100, "%")  # we're taking the value for PC1, transform it into percent, and make it a pasteable text string.
y_lab <- paste("PC2", round(k[2, 2], digits = 3) * 100, "%")  # we're taking the value for PC2, transform it into percent, and make it a pasteable text string.

pca_plot["unified_ID"] <- rownames(pca_plot)
pca_plot["species"] <- gsub("[0-9]*", "", pca_plot$unified_ID)

ggplot(pca_plot, aes(x = PC1, y = PC2)) + geom_point(size = 3, aes(shape = factor(species))) +
    labs(x = x_lab, y = y_lab) + theme(legend.position = "bottom")

Fantastic! Well done =) You’ve come so far!

To see if you’ve got it, can you produce a plot with PC2 and PC3?

k <- pca_summary$importance
x_lab <- paste("PC2", round(k[2, 2], digits = 3) * 100, "%")  # we're taking the value for PC1, transform it into percent, and make it a pasteable text string.
y_lab <- paste("PC3", round(k[2, 3], digits = 3) * 100, "%")  # we're taking the value for PC2, transform it into percent, and make it a pasteable text string.

pca_plot["unified_ID"] <- rownames(pca_plot)
pca_plot["species"] <- gsub("[0-9]*", "", pca_plot$unified_ID)

ggplot(pca_plot, aes(x = PC2, y = PC3)) + geom_point(size = 3, aes(shape = factor(species))) +
    labs(x = x_lab, y = y_lab) + theme(legend.position = "bottom")

From here if you feel that you want to continue, there is more things we can do with this data. We can add a third axis (PC3) and produce interactive 3D plots. To have one more nice aspect to work with, I’ll add sample depth to the data frame.

pca_plot <- left_join(pca_plot, meta_data[, c("unified_ID", "Depth")])

library(plotly)

# format background and axes

axx <- list(backgroundcolor = "rgb(211,211,211)", gridcolor = "rgb(255,255,255)",
    title = "PC1", showbackground = TRUE)
axy <- list(backgroundcolor = "rgb(211,211,211)", gridcolor = "rgb(255,255,255)",
    title = "PC2", showbackground = TRUE)
axz <- list(backgroundcolor = "rgb(211,211,211)", gridcolor = "rgb(255,255,255)",
    title = "PC3", showbackground = TRUE)

pca_3D <- plot_ly(pca_plot, x = ~pca_plot$PC1, y = ~pca_plot$PC2, z = ~pca_plot$PC3,
    symbol = ~pca_plot$species, symbols = c("diamond", "x", "circle"), color = ~pca_plot$Depth) %>%
    add_markers() %>%
    layout(scene = list(xaxis = axx, yaxis = axy, zaxis = axz))
pca_3D
# you can saving this as html file to open in you browser like so: f<-
# basename(tempfile('pca_3D_plotly', '.', '.html')) on.exit(unlink(f), add =
# TRUE) html <- htmlwidgets::saveWidget(pca_3D, f)

If you’re interested in metabolomics, check out the fantastic package ropls And now go and take a break! You deserve it!